{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WoegBf2gjiW4"
      },
      "source": [
        "> **NOTE**: This tutorial will demonstrate how to utilize the GPU provided by Colab to run LLM with Xinference local server, and how to interact with the model in different ways (OpenAI-Compatible endpoints/Xinference's builtin Client/LangChain).\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FAhwDgtUGIEo"
      },
      "source": [
        "# Xinference\n",
        "\n",
        "Xorbits Inference (Xinference) is an open-source platform to streamline the operation and integration of a wide array of AI models. With Xinference, you’re empowered to run inference using any open-source LLMs, embedding models, and multimodal models either in the cloud or on your own premises, and create robust AI-driven applications.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WzJegrOpGH4N"
      },
      "source": [
        "\n",
        "* [Docs](https://inference.readthedocs.io/en/latest/index.html)\n",
        "* [Built-in Models](https://inference.readthedocs.io/en/latest/models/builtin/index.html)\n",
        "* [Custom Models](https://inference.readthedocs.io/en/latest/models/custom.html)\n",
        "* [Deployment Docs](https://inference.readthedocs.io/en/latest/getting_started/using_xinference.html)\n",
        "* [Examples and Tutorials](https://inference.readthedocs.io/en/latest/examples/index.html)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LovckG0kGr9j"
      },
      "source": [
        "## Set up the environment\n",
        "\n",
        "> **NOTE**: We recommend you run this demo on a GPU. To change the runtime type: In the toolbar menu, click **Runtime** > **Change runtime typ**e > **Select the GPU (T4)**\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bPDeDltCGABt"
      },
      "source": [
        "### Check memory and GPU resources"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "qhoItBBhF7uY",
        "outputId": "99209d5a-78a2-405b-c8b2-4b840ecc78f1"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "RAM: 12.67 GB\n",
            "=============GPU INFO=============\n",
            "Wed Jan  3 08:02:08 2024       \n",
            "+---------------------------------------------------------------------------------------+\n",
            "| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |\n",
            "|-----------------------------------------+----------------------+----------------------+\n",
            "| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
            "| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |\n",
            "|                                         |                      |               MIG M. |\n",
            "|=========================================+======================+======================|\n",
            "|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |\n",
            "| N/A   62C    P8              10W /  70W |      3MiB / 15360MiB |      0%      Default |\n",
            "|                                         |                      |                  N/A |\n",
            "+-----------------------------------------+----------------------+----------------------+\n",
            "                                                                                         \n",
            "+---------------------------------------------------------------------------------------+\n",
            "| Processes:                                                                            |\n",
            "|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |\n",
            "|        ID   ID                                                             Usage      |\n",
            "|=======================================================================================|\n",
            "|  No running processes found                                                           |\n",
            "+---------------------------------------------------------------------------------------+\n"
          ]
        }
      ],
      "source": [
        "import psutil\n",
        "import torch\n",
        "\n",
        "\n",
        "ram = psutil.virtual_memory()\n",
        "ram_total = ram.total / (1024**3)\n",
        "print('RAM: %.2f GB' % ram_total)\n",
        "\n",
        "print('=============GPU INFO=============')\n",
        "if torch.cuda.is_available():\n",
        "  !/opt/bin/nvidia-smi || ture\n",
        "else:\n",
        "  print('GPU NOT available')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eFzlnU4gG_JL"
      },
      "source": [
        "### Install Xinference and dependencies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "1eReutJA_jS_"
      },
      "outputs": [],
      "source": [
        "%pip install -U -q typing_extensions==4.5.0 xinference[transformers] openai langchain"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "vPq_TWiRQCAA",
        "outputId": "4076a2ed-d42d-43ab-8e5d-cd57ffdab7ba"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Name: xinference\n",
            "Version: 0.7.4.1\n",
            "Summary: Model Serving Made Easy\n",
            "Home-page: https://github.com/xorbitsai/inference\n",
            "Author: Qin Xuye\n",
            "Author-email: qinxuye@xprobe.io\n",
            "License: Apache License 2.0\n",
            "Location: /usr/local/lib/python3.10/dist-packages\n",
            "Requires: click, fastapi, fsspec, gradio, huggingface-hub, modelscope, openai, pydantic, requests, s3fs, sse-starlette, tabulate, torch, tqdm, typing-extensions, uvicorn, xoscar\n",
            "Required-by: \n"
          ]
        }
      ],
      "source": [
        "!pip show xinference"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2EACA0GYHm2o"
      },
      "source": [
        "## A Quick Start Demo\n",
        "### Start Local Server\n",
        "\n",
        "\n",
        "To start a local instance of Xinference, run `xinference` in the background via `nohup`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "5EM01Gq7IQ2y"
      },
      "outputs": [],
      "source": [
        "!nohup xinference-local  > xinference.log 2>&1 &"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WXtUJSC3I3kh"
      },
      "source": [
        "Congrats! You now have Xinference running in Colab machine. The default host and ip is 127.0.0.1 and 9997 respectively.\n",
        "\n",
        "\n",
        "Once Xinference is running, there are multiple ways we can try it: via the web UI, via cURL, via the command line, or via the Xinference’s python client."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0mkyrGIHJekz"
      },
      "source": [
        "The command line tool is `xinference`. You can list the commands that can be used by running:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "yayFuLIgJhYX",
        "outputId": "56a696ee-b37d-44b6-9f07-39f18cffd099"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Usage: xinference [OPTIONS] COMMAND [ARGS]...\n",
            "\n",
            "  Xinference command-line interface for serving and deploying models.\n",
            "\n",
            "Options:\n",
            "  -v, --version       Show the current version of the Xinference tool.\n",
            "  --log-level TEXT    Set the logger level. Options listed from most log to\n",
            "                      least log are: DEBUG > INFO > WARNING > ERROR > CRITICAL\n",
            "                      (Default level is INFO)\n",
            "  -H, --host TEXT     Specify the host address for the Xinference server.\n",
            "  -p, --port INTEGER  Specify the port number for the Xinference server.\n",
            "  --help              Show this message and exit.\n",
            "\n",
            "Commands:\n",
            "  chat           Chat with a running LLM.\n",
            "  generate       Generate text using a running LLM.\n",
            "  launch         Launch a model with the Xinference framework with the...\n",
            "  list           List all running models in Xinference.\n",
            "  register       Registers a new model with Xinference for deployment.\n",
            "  registrations  Lists all registered models in Xinference.\n",
            "  terminate      Terminate a deployed model through unique identifier...\n",
            "  unregister     Unregisters a model from Xinference, removing it from...\n"
          ]
        }
      ],
      "source": [
        "!xinference --help"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wvhIEjcHKXc5"
      },
      "source": [
        "### Run Qwen-Chat\n",
        "\n",
        "Xinference supports a variety of LLMs. Learn more in https://inference.readthedocs.io/en/latest/models/builtin/.\n",
        "\n",
        "Let’s start by running a built-in model: `Qwen-1_8B-Chat`.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "z7OyMw8sKjj6"
      },
      "source": [
        "We can specify the model’s UID using the `--model-uid` or `-u` flag. If not specified, Xinference will generate it. This create a new model instance with unique ID `my-llvm`:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "B_hQFqxOKiww",
        "outputId": "e46d4138-20c5-4238-e131-f48bd0cd6e94"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Model uid: my-llm\n"
          ]
        }
      ],
      "source": [
        "!xinference launch -u my-llm -n qwen-chat -s 1_8 -f pytorch"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Q--ic56eNDyo"
      },
      "source": [
        "When you start a model for the first time, Xinference will download the model parameters from HuggingFace, which might take a few minutes depending on the size of the model weights. We cache the model files locally, so there’s no need to redownload them for subsequent starts.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cfF-cCFlMCvE"
      },
      "source": [
        "## Interact with the running model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MYKNW0c-MONc"
      },
      "source": [
        "Congrats! You now have the model running by Xinference. Once the model is running, we can try it out either command line, via cURL, or via Xinference’s python client:\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VfZZham7Lj3X"
      },
      "source": [
        "### 1.Use the OpenAI compatible endpoint\n",
        "\n",
        "Xinference provides OpenAI-compatible APIs for its supported models, so you can use Xinference as a local drop-in replacement for OpenAI APIs. For example:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "GOStrwtRLehN",
        "outputId": "3a67ba15-271a-4841-bdd2-d260a4ebb0ff"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "ChatCompletion(id='chat899575cc-aa0e-11ee-9dba-0242ac1c000c', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='I am an AI language model created by Alibaba Cloud. I have been trained on a vast amount of text data and can answer questions, provide suggestions, and engage in conversations with users. How may I assist you today?', role='assistant', function_call=None, tool_calls=None))], created=1704268990, model='my-llm', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=44, prompt_tokens=23, total_tokens=67))"
            ]
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import openai\n",
        "\n",
        "messages=[\n",
        "    {\n",
        "        \"role\": \"user\",\n",
        "        \"content\": \"Who are you?\"\n",
        "    }\n",
        "]\n",
        "\n",
        "client = openai.Client(api_key=\"empty\", base_url=f\"http://0.0.0.0:9997/v1\")\n",
        "client.chat.completions.create(\n",
        "    model=\"my-llm\",\n",
        "    messages=messages,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AmYB-_K4aXnG"
      },
      "source": [
        "### 2. Send request using curl\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "n7VLGirDaaR3",
        "outputId": "ff2d2a17-1e7f-46dc-818f-72f520ee5607"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{\"id\":\"chat8bd9a524-aa0e-11ee-9dba-0242ac1c000c\",\"object\":\"chat.completion\",\"created\":1704268994,\"model\":\"my-llm\",\"choices\":[{\"index\":0,\"message\":{\"role\":\"assistant\",\"content\":\"It is difficult to determine which animal is the largest as there is no single animal that can be considered the \\\"largest\\\" in all senses. The size of an animal can vary greatly depending on its habitat, species, and individual characteristics.\\n\\nFor example, giant pandas are known for their large size, with males weighing up to 200 pounds and females weighing up to 135 pounds. Other large animals include blue whales, the world's largest animal, with estimated populations ranging from over 100,000 individuals; elephants, which can weigh over 6,000 pounds and stand up to \"},\"finish_reason\":\"length\"}],\"usage\":{\"prompt_tokens\":25,\"completion_tokens\":127,\"total_tokens\":152}}"
          ]
        }
      ],
      "source": [
        "!curl -k -X 'POST' -N \\\n",
        "  'http://127.0.0.1:9997/v1/chat/completions' \\\n",
        "  -H 'accept: application/json' \\\n",
        "  -H 'Content-Type: application/json' \\\n",
        "  -d '{ \"model\": \"my-llm\", \"messages\": [ {\"role\": \"system\", \"content\": \"You are a helpful assistant.\" }, {\"role\": \"user\", \"content\": \"What is the largest animal?\"} ]}'"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RJ_72F51XFZY"
      },
      "source": [
        "### 3. Use Xinference's Python client"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ohZPPubkXKLl",
        "outputId": "51ae6581-216c-4b30-8b88-0965193cd091"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'id': 'chat8cef808c-aa0e-11ee-9dba-0242ac1c000c',\n",
              " 'object': 'chat.completion',\n",
              " 'created': 1704268996,\n",
              " 'model': 'my-llm',\n",
              " 'choices': [{'index': 0,\n",
              "   'message': {'role': 'assistant',\n",
              "    'content': \"Hello! I'm here to help answer any questions you may have. Is there something specific you would like to know about animals or anything else?\"},\n",
              "   'finish_reason': 'stop'}],\n",
              " 'usage': {'prompt_tokens': 31, 'completion_tokens': 29, 'total_tokens': 60}}"
            ]
          },
          "execution_count": 9,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from xinference.client import RESTfulClient\n",
        "client = RESTfulClient(\"http://127.0.0.1:9997\")\n",
        "model = client.get_model(\"my-llm\")\n",
        "model.chat(\n",
        "    prompt=\"hello\",\n",
        "    chat_history=[\n",
        "    {\n",
        "        \"role\": \"user\",\n",
        "        \"content\": \"What is the largest animal?\"\n",
        "    }]\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "P4PaU0fpdAuB"
      },
      "source": [
        "### 4. Langchain intergration"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ijeadB9DdDO8",
        "outputId": "1b8969cc-5d05-4cee-fac6-d8aa7d0efbd4"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            " The answer to this question is subjective and can vary depending on factors such as the definition of \"largest\" and location. However, some estimates put the size of a particular plant at over 50 feet tall or more.\n",
            "\n",
            "One example of such a large plant is the giant sequoia tree ( Sequoia sempervirens), which is found only in California, USA. The tree has a diameter of up to 138 feet and can grow up to 600 feet tall. It is considered one of the oldest living organisms on Earth and is estimated to be over 27 million years old.\n",
            "\n",
            "Another\n"
          ]
        }
      ],
      "source": [
        "from langchain.llms import Xinference\n",
        "from langchain.chains import LLMChain\n",
        "from langchain.prompts import PromptTemplate\n",
        "\n",
        "llm = Xinference(server_url='http://127.0.0.1:9997', model_uid='my-llm')\n",
        "\n",
        "template = 'What is the largest {kind} on the earth?'\n",
        "\n",
        "prompt = PromptTemplate(template=template, input_variables=['kind'])\n",
        "\n",
        "llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
        "\n",
        "generated = llm_chain.run(kind='plant')\n",
        "print(generated)"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
